Skip to content

MoE multi-chip experts example#720

Open
puddingfjz wants to merge 2 commits intohw-native-sys:mainfrom
puddingfjz:moe_distributed_demo
Open

MoE multi-chip experts example#720
puddingfjz wants to merge 2 commits intohw-native-sys:mainfrom
puddingfjz:moe_distributed_demo

Conversation

@puddingfjz
Copy link
Copy Markdown

@puddingfjz puddingfjz commented May 8, 2026

Summary

This PR adds a focused L3 example for a distributed MoE-style workflow with
one expert per chip.

The example demonstrates:

  • L3 multi-chip worker setup with HCCL bootstrap windows
  • Cross-rank dispatch of per-expert token slices
  • Per-rank expert compute on the dispatched recv buffer
  • Cross-rank combine back to each source rank's output tensor
  • A pytest wrapper for running the end-to-end hardware case

The pipeline is intentionally small (NUM_TOKENS = 10, HIDDEN_DIM = 16,
COUNT = 4) so the data movement is easy to inspect while still exercising
dispatch, compute, and combine across chips.

Testing

  • conda run -n simpler_issue python3 -m py_compile examples/workers/l3/moe_multi_chip_experts/main.py examples/workers/l3/moe_multi_chip_experts/ test_moe_multi_chip_experts.py
  • Hardware pytest through task-submit on devices 9,10: 1 passed in 8.50s

Test Configuration:
- 4 experts (one per chip)
- 10 tokens in context
- 4 tokens processed per expert
- Hidden dimension: 16

IMPORTANT: Current implementation tests DATA FLOW only, not actual MoE computation:
- Compute phase is a simple +1.0 operation, not expert network computation
- Focus is on verifying correct token routing and result gathering
- Can be extended to add real expert models later

Core Components:
- Kernels: dispatch (all-to-all), compute (+1.0), combine (all-to-all)
- Orchestration: end2end, dispatch-only, combine-only, dispatch+compute
- Unit Tests: test_dispatch_only, test_combine_only, test_dispatch_compute
- E2E Test: test_end2end with unique value tracing

KEY DESIGN: Use INDEPENDENT scratch_test buffer for combine phase
- Problem: Reusing scratch caused combine to read stale dispatch data
- Solution: Dispatch+Compute use scratch, Combine uses scratch_test
- Prevents corruption when combine's stage-in doesn't fully overwrite
  dispatch's data (writes 4 tokens, stride based on 10 NUM_TOKENS)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@puddingfjz puddingfjz force-pushed the moe_distributed_demo branch 6 times, most recently from 94fe535 to 1fb325a Compare May 8, 2026 08:16
- Keep the example focused on the end-to-end dispatch, compute, and combine path
- Remove obsolete debug docs, partial tests, and unused kernel variants
- Align README, test naming, and scratch buffer handling with the current two-chip hardware test
@puddingfjz puddingfjz force-pushed the moe_distributed_demo branch from 1fb325a to d47f536 Compare May 8, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant